45 - HPC Cafe 2022-12-13: Using the Zenodo document and data repository [ID:46252]

50 von 169 angezeigt

So, just some historical introduction. So the name is in order comes from the notice which is the first head librarian of the library of Alexander.

And the platform itself was developed under the open air program so as a tool to support open access.

And it is, it was and is still operated by certain, so it's hosted on the certain computing infrastructure, certain, as you probably know is the European Center for nuclear and particle physics research located next to Lake Geneva.

And it was established in 2013 as a pure document repository and later on, it became upgraded to also allow data.

So, technically, as I already said it is posted the service on the certain data center infrastructure, and as in order stores metadata and data separately so all files that are actually the data sets that you upload into some order are stored on the certain

data system. So, portion of 20 petabytes or so are currently allocated for Sonoro and the remaining hundreds of petabytes of for certain research and the metadata is stored separately in a Postgres database.

The whole the software the system runs on is called in video, and it was also originally developed by certain is now an open source project with multiple contributing constitutions.

In venue is very popular in the high energy physics community so inspire head which is the main bibliographic database uses it, a Caltech library.

And in order is basically the research data management part of the

project, just a few features before we go into the actual system. So, first of all, very important to notice free of charge. So you don't have to pay anything. At least you don't have to pay anything if you stay below the 50 gigabytes limit for a single data publication.

So, you can go above that is possible if you don't exceed like 100 gigabytes, but then usually they asked for a contribution to their.

And you probably also have to argue why you really need 100 gigabytes. So this is then case by case evaluation by sooner.

If you have to go about 400 gigabytes that you shouldn't say no.

You can't submit your data, just on the single or you can choose also put your data set under the umbrella of certain communities.

And we will have a look at it later on how they look like.

And very important to see the normal has a very nice UI service, so you've probably heard of the device, everyone has seen one.

Down here at the very bottom of the screen so that's a screenshot from archive. I just chose one publication about computing that I found on the day I made the slide.

And you can just see here.

Nowadays, archive also provides the eyes.

I was doing this in my time, and years ago. And so instead, so the archive ID is essentially upgraded to a UI. This data set now, this application is pretty now has a UI.

And this is a persistent identifier that guarantees that no matter what happens, using this idea you will always find this online, this digital object again in the world like that, in this case, somewhere on an archive.

And then, in order has a very clever way of assigning your eyes to data sets and we look at that later on. And this is a very nice feature, at least in my opinion, further features we will look at is non static metadata, and also different access levels so even,

even though the order was designed as an open access platform. You don't have to make the data open. If you put data on the only thing you need to make open is your metadata, which is again stored separately from the actual data, and this the these metadata

is a completely free license so CC zero, which is functionally equivalent to public domain, but legally not. But basically you can do everything, anything you want this metadata and that is necessary in order to ensure that the data you publish

is not compliant. So fair is this famous acronym, findable accessible into operable usable which you may have heard, it is essentially the standard, the European Union expects to have all data sets that come from European Union funded projects.

The node is built in such a way that of the three criteria, except accessible, findable accessible. The F and R are basically fulfilled by single alone, we don't have to worry about the interoperable part means your data set has to have clear structure

and follow certain standards, easily machine understandable. That's something you still have to do so the model doesn't touch your data sets, they are the way you look, you have to finally look short short at GitHub integration of you know, and

we will also get and the three last points are just that for you projects, as you know, it will automatically register the metadata in open air. The node doesn't have any particular tastes when it comes to file formats and types, and you are always responsible

for your upload. So if you violate copyright on data protection that's on you.

So I'm already locked in. So in my case, I looked in via my orchid. And we can have a look at the front end basically but it's convenient to just look at one data set and I chose one that is special because it has a lot of downloads it was downloaded

over 200,000 times, and it is updates relatively often. So it's about Twitter chatter about code, and it's a free data set, and it's open access as you can see up here.

And what you can see here is already the metadata it's in order makes public for every single data set. So, always the title. Everything will offer a description, which is mandatory, you can include nice features in your description so you probably most of you will

think having a link in a description seems trivial. Most data archives do not offer this feature. So you can just text, so including a link or a table or something like that is not too common.

When you look through standard data archives. So, in our case, our published publication platform at the new work dissertations, or the post has doesn't have the feature to include a link in the description that results.

So, you can just copy and paste it, of course, but not a clickable. So it's in order.

And then you can see the data set in the preview. In this case, and all the different files, you can download them and you can see lots of one gigabyte files, so it's a relatively large data set already.

And on the right hand side we see a few other interesting piece of metadata. So in this case it's connected to a publication in the demolition.

So it's not really interesting, it is connected to an archive publication and so on. But what is really interesting is this part here. This is the version history, and you can see it's version 144.

And it's an order of science, single do I for every version. And this is the UI that's used to cite this particular version so if anyone uses a data set from this chain of versions.

The person can exactly site I use this version and no other. And this is the way you should cite a data set, because you want to make sure that everyone knows which data you actually use for future research.

But you know also provides global so called concept UI, which is always down here. And this is essentially a pointer that always points to the latest version of the data set, but is meant to be a representative for the whole stack of versions.

And this is very useful if you want to include a link on your web page to something you publish, because you can just link, we buy this UI, and it will always track the latest version. So you can just tell the visitors on your web page.

Please look at our data set you can find it here, and here will always resolve to the latest version. You don't have to change anything when you upload something. So that's quite convenient.

So, quite nice, and we can look at this metadata in a different way when we upload the data. And I'm just doing this now.

I'm testing. So I have a test upload here, which is essentially empty. So I haven't uploaded any files. You could just drag and drop it, or you can, if you, if you have a very large file and you have problems with this, you can also use in principle the API to deposit

something, but I won't show this, I'll just show the basic front end way to deposit it.

And I will just scroll down, and in this here you can see upload type. So it's in order basically take the anything.

So, the most common cases are software, the data set, the presentation or publication.

And in our case I say it's the data set. And then I fill in basic information, like the title and author.

Good description so this is much far too short. So it should be descriptive.

Teil einer Videoserie :

HPC4FAU / NHR@FAU

Teil eines Kapitels:

HPC Café

Presenters

Dr. rer. nat. Jürgen Rohrwild

Zugänglich über

Offener Zugang

Dauer

00:32:07 Min

Aufnahmedatum

2022-12-13

Hochgeladen am

2022-12-21 18:26:05

Sprache

en-US

Invited talk at the NHR@FAU HPC Cafe

Speaker: Dr. Jürgen Rohrwild, FAU University Library

Slides: https://hpc.fau.de/files/2022/11/Zenodo_HPC.pdf

Abstract: Zenodo is a European, free-of-charge document and data repository. With its relatively accessible metadata model, it provides a low-threshold solution for a broad range of scientific datasets. Some potentially desirable features include dedicated Zenodo communities, GitHub integration, and versioning. Zenodo’s generic nature also comes with some trade-offs such as limited metadata fields and no collaborative elements. This short presentation will have a closer look at Zenodo’s features and limits.

Tags

Per RSS abonnieren